# Multimodal Image-Text Reasoning
Llama 3.2 90B Vision Instruct
Llama 3.2-Vision is a multimodal large language model developed by Meta, supporting image and text input with text output, excelling in visual recognition, image reasoning, image captioning, and visual question answering tasks.
Image-to-Text
Transformers Supports Multiple Languages

L
meta-llama
15.44k
337
Llama 3.2 11B Vision
Llama 3.2-Vision is a series of multimodal large language models developed by Meta, available in 11B and 90B scales, supporting image + text input and text output, optimized for visual recognition, image reasoning, image captioning, and visual question answering tasks.
Image-to-Text
Transformers Supports Multiple Languages

L
meta-llama
31.12k
511
Featured Recommended AI Models